Lab 9: Bootstrap Sampling

library(ISLR)
library(boot)
library(ggplot2)

Class Examples

Example 1

From the Notes

alpha=function(X,Y){
  vx=var(X)
  vy=var(Y)
  cxy=cov(X,Y)
  (vy-cxy)/(vx+vy-2*cxy)
}

head(Portfolio) #The Portfolio data set in the ISLR package
##            X          Y
## 1 -0.8952509 -0.2349235
## 2 -1.5624543 -0.8851760
## 3 -0.4170899  0.2718880
## 4  1.0443557 -0.7341975
## 5 -0.3155684  0.8419834
## 6 -1.7371238 -2.0371910
alpha(Portfolio$X,Portfolio$Y) #alpha hat
## [1] 0.5758321
## this is a simulation just to mimic the population

n=100
N <- 10000
alpha.sample <- numeric(N) #creating a numeric object

range(Portfolio$X)
## [1] -2.432764  2.460336
range(Portfolio$Y)
## [1] -2.725281  2.565985
# generating 1000 simulated data sets from the true population. Assumption: 1. Minimum and maximum values of X and Y from the sample Portfolio is the same as that of the Population.
# 2. Also assuming it's coming from a uniform distribution

for (i in 1:N)
{
  x <- runif(n,min(Portfolio$X),max(Portfolio$X)) 
  y <- runif(n,min(Portfolio$Y),max(Portfolio$Y))
  alpha.sample[i] <- alpha(x,y)
}

hist(alpha.sample, main = "Sampling distribution of alpha", col="orange") 
abline(v = mean(alpha.sample), col = "red", lty = 2)

This part is just was only trying to mimic the Population where this Portfolio data came from.

Assumptions: 1. Minimum and maximum values of X and Y from the sample Portfolio is the same as that of the Population. 2. Also assuming the population is coming from a uniform distribution

But in truth we only have one sample. So we need to use bootstrap sampling.

  1. Let’s run the bootstrap
dim(Portfolio)
## [1] 100   2
n=100 
N <- 10000
alpha.boot <- numeric(N) #creating a numeric object

for (i in 1:N)
{
  x <- sample(Portfolio$X, n, replace = TRUE)
  y <- sample(Portfolio$Y, n, replace = TRUE)
  alpha.boot[i] <- alpha(x,y)
}

hist(alpha.boot, main = "Bootstrap distribution of alpha", col="cadetblue2") 
abline(v = mean(alpha.boot), col = "red", lty = 2)

  1. What is the standard error of alpha?
mean(alpha.boot)
## [1] 0.5376162
sd(alpha.boot)
## [1] 0.04607754

Compare;

par(mfrow=c(1,3))

alpha <- data.frame(alpha.sample,alpha.boot)

hist(alpha.sample, main = "Sampling distribution of alpha", col="cadetblue2") 
abline(v = mean(alpha.sample), col = "red", lty = 2)

hist(alpha.boot, main = "Bootstrap distribution of alpha", col="orange") 
abline(v = mean(alpha.boot), col = "red", lty = 2)

boxplot(alpha )

#OR

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#install.packages("reshape2")                          
library("reshape2")
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
data_long <- melt(alpha)
## No id variables; using all as measure variables
head(data_long)
##       variable     value
## 1 alpha.sample 0.5606115
## 2 alpha.sample 0.5512288
## 3 alpha.sample 0.5709816
## 4 alpha.sample 0.4974464
## 5 alpha.sample 0.5061760
## 6 alpha.sample 0.5927827
data_long %>% 
  ggplot( aes(x = variable, y = value, fill=variable)) +
  geom_boxplot()

  1. Using the “boot” library;

First, we must create a function that computes the statistic of interest. Second, we use the boot() function, which is part of the boot library, to boot() perform the bootstrap by repeatedly sampling observations from the dataset with replacement.

To illustrate the use of the bootstrap on this data, we must first create a function, alpha.fn(), which takes as input the (X, Y ) data as well as a vector indicating which observations should be used to estimate \(\alpha\). The function then outputs the estimate for \(\alpha\) based on the selected observations.

This function returns, or outputs, an estimate for \(\alpha\) based on applying the formula to the observations indexed by the argument index.

#alpha.f=function(data,index){ #index=rows of your data set
 #with(data[index,], (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y)))
#}
#alpha.f(Portfolio,1:100) 

# this function uses the function "with" - it says using the data in the data frame
#execute the commands.

#OR

alpha.fn <- function(data, index) {
  X <- data$X[index]
  Y <- data$Y[index]
  (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y))
}

For instance, the following command tells R to estimate \(\alpha\) using all 100 observations.

dim(Portfolio)
## [1] 100   2
alpha.fn(Portfolio,1:100) #this gives you the alpha for data index 1 to 100
## [1] 0.5758321

The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing \(\hat{\alpha}\) based on the new data set.

#one bootstrap sample

alpha.fn(Portfolio,sample(1:100,100,replace=T)) #if you do this one time, the estimate is high
## [1] 0.5581076

We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for \(\alpha\), and computing the resulting standard deviation. However, the boot() function automates this approach. Below we produce R = 1000 bootstrap estimates for \(\alpha\).

# Let's run the bootstrap
set.seed(1)
## but when you repeat it for many times, estimate get closer to the original

boot.out= boot(Portfolio,alpha.fn,R=1000) #repeat this 1000 times
boot.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5758321 -0.001596422  0.09376093
plot(boot.out) #symmetric distribution, looks like gaussian

The final output shows that using the original data, \(\hat{\alpha}= 0.5758\), and that the bootstrap estimate for \(SE(\hat{\alpha})\) is 0.093.

Example 2:

## Let's take a sample of size 25 from a standard normal distribution.
x0 <- rnorm(25)

## I'm going to take 3 bootstrap samples from the original sample
x.boot1 <- sample(x0, 25, replace = T)
x.boot2 <- sample(x0, 25, replace = T)
x.boot3 <- sample(x0, 25, replace = T)

## Let's compare the ecd plots
ecdf.x0 <- ecdf(x0)
ecdf.x.boot1 <- ecdf(x.boot1)
ecdf.x.boot2 <- ecdf(x.boot2)
ecdf.x.boot3 <- ecdf(x.boot3)
xx <- seq(-3,3,by = .01)

#First let's plot the ecdf of the normal distribution (true distribution)
plot(xx,pnorm(xx), type = 'l', xlab = "x")

#ecdf of our original sample
lines(xx,ecdf.x0(xx), type = 's', col = 2, lwd = 4)
## this follows the true distribution but it has few jumps. 


#ecdf of the bootstrap samples
lines(xx,ecdf.x.boot1(xx), type = 's', col = 3, lwd = 2)
lines(xx,ecdf.x.boot2(xx), type = 's', col = 4, lwd = 2)
lines(xx,ecdf.x.boot3(xx), type = 's', col = 5, lwd = 2)

Example 3: Bangladesh

Arsenic is a naturally occurring element in the ground water of Bangladesh. However, much of this ground water is used for drinking water by rural populations, so arsenic poisoning is a serious health issue.

a.Plot the distribution of arsenic concentrations from 271 wells in Bangladesh.

Bangladesh <- read.csv("Bangladesh.csv")

head(Bangladesh)
##   Arsenic Chlorine Cobalt
## 1    2400      6.2   0.42
## 2       6    116.0   0.45
## 3     904     14.8   0.63
## 4     321     35.9   0.68
## 5    1280     18.9   0.58
## 6     151      7.8   0.35
Arsenic <- Bangladesh$Arsenic
hist(Arsenic) #Arsenic levels in 271 wells in Bangladesh.

length(Arsenic)
## [1] 271
  1. The sample mean and standard deviation are mean=125.31 and sd=297.98, respetively (measured in micrograms per liter).
mean(Arsenic)
## [1] 125.3199
sd(Arsenic)
## [1] 297.9755
  1. We draw resamples of size 271 with replacement from the data and compute the mean for each resample.
# bootstrap samples to estimate sample mean

n <- length(Arsenic)
N <- 10000

arsenic.mean <- numeric(N) 
for (i in 1:N)
{
  x <- sample(Arsenic, n, replace = TRUE)
  arsenic.mean[i] <- mean(x) #bootstrap sample mean
}
  1. Let’s plot the histogram and normal quantile plot of the bootstrap distribution.
hist(arsenic.mean, main = "Bootstrap distribution of means",col = '#E5CCFF') 
abline(v = mean(Arsenic), col = "red", lty = 2) # observed mean 

qqnorm(arsenic.mean)
qqline(arsenic.mean)

The bootstrap distribution looks quite normal, with some skewness. This is the Central Limit Theorem (CLT) at work—when the sample size is large enough, the sampling distribution for the mean is approximately normal, even if the population is not normal.

  1. Let’s calculate the bootstrap mean,bias and std. error
(boot.mean= mean(arsenic.mean)) # bootstrap mean
## [1] 125.1644
(bias= mean(arsenic.mean)-mean(Arsenic)) # estimated bias
## [1] -0.1555756
(boot.sd=sd(arsenic.mean)) # bootstrap SE
## [1] 17.96738

The mean of the bootstrap means is 125.1643506, quite close to the sample mean (the difference is -0.1555756. The bootstrap standard error is the standard deviation of the bootstrap distribution; in this case, the bootstrap standard error is 17.967381.

OR

# Let's run the bootstrap
set.seed(1)
## but when you repeat it for many times, estimate get closer to the original
mean.fn=function(data,ind){mean(data[ind])}#specify the index

boot.out2= boot(Arsenic,mean.fn,R=1000) #repeat this 1000 times
boot.out2
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Arsenic, statistic = mean.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 125.3199 -0.3058926    17.78515
plot(boot.out2) #symmetric distribution, looks like gaussian

  1. Let’s calculate the \(95\%\) confidence Intervals.

For the normal distribution, we know that the 2.5 and 97.5 percentiles are at the mean plus or minus 1.96 standard deviations.

(LL=boot.mean-1.96*boot.sd)
## [1] 89.94828
(UL=boot.mean+1.96*boot.sd)
## [1] 160.3804
(a=sum(arsenic.mean > UL)/N)
## [1] 0.03
(b=sum(arsenic.mean < LL)/N)
## [1] 0.0179

But for this particular bootstrap distribution, we find that 3% of the resample means are below the bootstrap mean minus 1.96SE, and 1.79% of the resample means are above the bootstrap mean plus 1.96SE. In this case, relying on the CLT would be inaccurate.

g. BOOTSTRAP PERCENTILE INTERVALS

quantile(arsenic.mean, c(0.025, 0.975))
##      2.5%     97.5% 
##  92.35348 162.00325

In the arsenic example, the \(2.5\%\) and \(97.5\%\) points of the bootstrap distribution give us the interval \((92.98, 162.20)\), so we are \(95\%\) confident that the true mean arsenic level is between 92.98 and 162.20 micro gram per liter. Note that with true mean is 125.3782, this interval can be written

92.98736-mean(arsenic.mean)
## [1] -32.17699
162.20634-mean(arsenic.mean)
## [1] 37.04199

\((\bar{x} - 32.39085, \bar{x}+ 36.82813)\) in particular, this interval is not symmetric about the mean, reflecting the asymmetry of the bootstrap distribution.

The arsenic data illustrate an interesting point. A good confidence interval for the mean need not necessarily be symmetric: an endpoint will be farther from the sample mean in the direction of any outliers.

A confidence interval is an insurance policy: rather than relying on a single statistic, the sample mean, as an estimate of \(\mu\), we give a range of plausible values for \(\mu\).

BIAS

Example 5

##### Demonstration of biased estimator  ####

lambda = 2
n = 10 # sample size

# make a sample and estimate lambda

x.1 <- rexp(n,lambda)

1/mean(x.1) # = lambda.hat = this is the estimate of means
## [1] 1.23024
# make many samples and make a histogram

# true lamba in blue
# mean of estimates in red
# positive bias

lambda.0 <- replicate(10000,1/mean(rexp(n,lambda)))

bias.0 = mean(lambda.0) - lambda

hist(lambda.0, breaks = 40,prob= T, main = paste("Bias = ", round(bias.0,3)))
abline(v = lambda, col = 4, lwd = 3)
abline(v = mean(lambda.0), col = 2, lwd = 3)

Assuming we don’t have access to the population, we could re sample from the original sample x.1.

# Assess this bias with the bootstrap

lambda.0.boot <- replicate(10000, 1/mean(sample(x.1,n,replace = T))) 
#this is resampling = bootstrap

# Bias estimate

(bias.boot= mean(lambda.0.boot) - 1/mean(x.1))
## [1] 0.08892219
hist(lambda.0.boot,breaks = 40,prob= T, main = paste("Bias = ", round(bias.boot,3)))
abline(v = lambda, col = 4, lwd = 3)
abline(v = mean(lambda.0), col = 2, lwd = 3)

Compare

par(mfrow=c(1,3))

lambda.hat <- data.frame(lambda.0,lambda.0.boot)

hist(lambda.hat[,1], main = "Sampling distribution of lambda", col="cadetblue2") 
abline(v = mean(lambda.hat[,1]), col = "red", lty = 2)

hist(lambda.hat[,2], main = "Bootstrap distribution of lambda", col="orange") 
abline(v = mean(lambda.hat[,2]), col = "red", lty = 2)

boxplot(lambda.hat )
abline(h=lambda,col=2)

Correct the estimate by substracting the estimated bias from the original estimate.

1/mean(x.1) - bias.boot # original mean estimate - bias
## [1] 1.141318
### closer to the original lambda=2

Two sample Bootstrap

Example 4

A high school student was curious about the total number of minutes devoted to commercials during any given half-hour time period on basic and extended cable TV channels (Rodgers and Robinson (2004)).

TV = read.csv("TV.csv")
head(TV)
##   ID Times Cable
## 1  1   7.0 Basic
## 2  2  10.0 Basic
## 3  3  10.6 Basic
## 4  4  10.2 Basic
## 5  5   8.6 Basic
## 6  6   7.6 Basic
times.Basic <- TV$Times[TV$Cable == "Basic"]
#times.Basic <- subset(TV, select = Times,subset = Cable == "Basic", drop = T)


times.Ext <- TV$Times[TV$Cable == "Extended"]
#times.Ext <- subset(TV, select = Times,subset = Cable == "Extended", drop = T)

times.Basic
##  [1]  7.0 10.0 10.6 10.2  8.6  7.6  8.2 10.4 11.0  8.5
times.Ext
##  [1] 3.4 7.8 9.4 4.7 5.4 7.6 5.0 8.0 7.8 9.6

The means of the basic and extended channel commercial times are 9.21 and 6.87 min, respectively, so on average, commercials on basic channels are 2.34 min longer than on extended channels. Is this difference of 2.34 min statistically significant? The poor student could only stand to watch 10 h of random TV, so his observations may not accurately represent the TV universe.

mean(times.Basic)
## [1] 9.21
mean(times.Ext)
## [1] 6.87
mean(times.Basic)-mean(times.Ext)
## [1] 2.34
table(TV$Cable)
## 
##    Basic Extended 
##       10       10

The original data are simple random samples of size 10 from two populations.

We draw a bootstrap sample from the basic channel data and independently draw a bootstrap sample from the extended channel data, compute the means for each sample, and take the difference.

N <- 10000
times.diff.mean <- numeric(N) 

for (i in 1:N)
{
Basic.sample <- sample(times.Basic, 10, replace = TRUE) 
Ext.sample <- sample(times.Ext, 10, replace = TRUE)
times.diff.mean[i] <- mean(Basic.sample) - mean(Ext.sample)
}

hist(times.diff.mean,main = "Bootstrap distribution of difference in means",col = '#FFFF99')
abline(v = mean(times.Basic) - mean(times.Ext), col = "blue", lty = 2)

qqnorm(times.diff.mean)
qqline(times.diff.mean)

This displays the bootstrap distribution of the difference of sample means. As in the single-sample case, we see that the bootstrap distribution is approximately normal and centered at the original statistic (the difference in sample means).

We also get a quick idea of how much the difference in sample means varies due to random sampling. We may quantify this variation by computing the bootstrap standard error, which is 0.76. Again, the bootstrap standard error is the standard error of the sampling distribution.

mean(times.Basic) - mean(times.Ext) #original smaple
## [1] 2.34
mean(times.diff.mean) #bootstrap sample
## [1] 2.333684
sd(times.diff.mean)
## [1] 0.752676
# 95% CI
quantile(times.diff.mean, c(0.025, 0.975))
##  2.5% 97.5% 
##  0.88  3.81
mean(times.diff.mean) - (mean(times.Basic) - mean(times.Ext)) # bias
## [1] -0.006316

The \(95\%\) bootstrap percentile confidence interval for the difference in means (basic- extended) is (0.89, 3.80). Thus, we are \(95\%\) confident that commercial times on basic channels are, on average, between 0.89 and 3.80 min longer than on extended channels (per half-hour time periods).

Hypothesis testing using bootstrap approach.

(Chihara2011)

Example 6

Verizon is the primary local telephone company (incumbent local exchange carrier, ILEC) for a large area of the eastern United States. As such, it is responsible for providing repair service for the customers of other telephone companies known as competing local exchange carriers (CLECs) in this region. Verizon is subject to fines if the repair times (the time it takes to fix a problem) for CLEC customers are sub- stantially worse than those for Verizon customers.

The data set Verizon contains a random sample of repair times for 1664 ILEC and 23 CLEC customers. The mean repair time for ILEC customers is 8.4 hours, while that for CLEC customers is 16.5 h.

Could a difference this large be easily explained by chance? Or is there another reason?

Verizon = read.csv("Verizon.csv")
head(Verizon)
##    Time Group
## 1 17.50  ILEC
## 2  2.40  ILEC
## 3  0.00  ILEC
## 4  0.65  ILEC
## 5 22.23  ILEC
## 6  1.20  ILEC
tapply(Verizon$Time, Verizon$Group, mean)
##      CLEC      ILEC 
## 16.509130  8.411611

Let’s make a hypothesis. Are two population means different?

Null Hypothesis: The diffference of the means are equal to 0. (That means that there is no difference between the two population means - that means there is no discriminating)

Let’s use the bootstrap approach.

  1. First let’s seperate the two sampples “CLEC” and “ILEC”.
verizon.clec <- Verizon$Time[Verizon$Group == "CLEC"]
verizon.ilec <- Verizon$Time[Verizon$Group == "ILEC"]
  1. Make many bootstrap samples(1000) of sample means for each group. Record the differences in another variable. What is the average of the differences? Compare this with the actual mean difference.
z.1 <- rep(NA,10000)

for (j in 1:10000){
  boot.ilec <- mean(sample(verizon.ilec, length(verizon.ilec), replace = T))
  boot.clec <- mean(sample(verizon.clec, length(verizon.clec), replace = T))
  z.1[j] <- boot.ilec - boot.clec #the difference
}

mean(z.1) #bootstrap mean difference
## [1] -8.091132
mydiff = function(mydf){
  index = mydf$Group == "ILEC"
  return(mean(mydf$Time[index]) - mean(mydf$Time[!index]))
}

mydiff(Verizon) #actual mean difference from the original sample
## [1] -8.09752

The mean of the bootstrap distribution is close to the actual difference of means.

hist(z.1, breaks = 40, prob = T, main = "Histogram of bootstrap sample means",col = '#99FF99')
abline(v = mydiff(Verizon), col = 2, lwd = 2)

  1. Make a 95% percentile confidence interval for the difference estimate.

Let’s make a hypothesis. Are two population means different?

Null Hypothesis: The diffference of the means are equal to 0. (That means that there is no difference between the two population means - that means there is no discriminating)

ci.1 <- quantile(z.1, c(.025, .975))
ci.1 #null value is not inside the confidence intervel = reject the null hyp.
##       2.5%      97.5% 
## -16.726885  -1.748643
hist(z.1, breaks = 40, prob = T, main = "Histogram of bootstrap sample means")
abline(v = mydiff(Verizon), col = 2, lwd = 2)
abline(v = ci.1, col = 4, lwd = 2)

CI does not contain 0. True difference is likely to be negative.

That means we reject the null hypothesis at 5% significance level. That means that this is not due to chance, but also may be due to some discrimination among their own customers and customers from the competing companies.